--- layout: page title: Multivariate Regression permalink: /Rcoding/coding4/ parent: R Coding nav_order: 4 --- Multivariate Regression


Main Commands Introduced in this Session

na.omit() #omit  observations with missing data
merge() #merge two objects
as.data.frame() #convert object to data frame
relevel() #change the factor level of variable
scatterplot3d() #create a scatterplot in 3d-design
residuals() #extract residual values from model objects
stargazer() #create well-formatted regression tables

Multivariate Regression

Previously you learned how make predictions of values of a dependent variable based on the values of an independent variable. Remember when we predicted the Leave vote share across constituencies based on the mean age? Or the Leave vote share based on the percentage of foreign-born residents? In both cases, we had one variable to predict the Leave vote share. We call these variables predictors, regressors or independent variables. What was common in both of our models was that the variable we were trying to predict (the regressand or dependent variable) was the Leave vote share. Both age and share of foreign-born residents seemed to be strongly related with the dependent variable.

But it might well be that there is an omitted factor that relates to both Leave vote share and the mean age (or the share of foreigners)! Think of ice cream sales and deaths by drowning. You are likely to find a strong correlation between the two but is ice cream sales really causing deaths by drowning? No, of course both are correlated with summer temperatures. Or you might be interested in proving that going to the doctor makes you healthier one year later. After collecting your data, you find no effect. But this might be because people who go to the doctor may be less healthy in the first place, so we’re not comparing like with like. When you take into account prior health conditions, you are more likely to find a positive effect of visiting the doctor on future health outcomes.

In regression analysis, we control for factors that are related to both the dependent and independent variable. We say that you control or adjust for a variable when you add to your model one or more variables, so that these can be kept constant when you interpret the relationship between your dependent and independent variables of interest. In the above example, we need to control for prior health conditions, as it is related to both the likelihood to go to the doctor and future health conditions. Likewise, when predicting death by drowning, you add summer to your model, as it is correlated with both deaths by drowning and ice-cream sales. In both cases you identify a confounder, something that is causally related to both the dependent and independent variables.

• A confounder is a pre-treatment variable that influences both your dependent variable \(Y\) and your main independent variable of interest \(X\).

• An omitted variable is any variable that is omitted from the regression, pre- or postdetermined.

Remember that in bivariate linear regression analysis, we fit a single line which minimises the sum of squared residuals through a scatter plot. Similarly, in multivariate linear regression analysis, we fit a (hyper)plane through a multi-dimensional cloud of data points - and this (hyper)plane is, by a similar logic, the one that minimises the sum of squared residuals. The simplest form of multivariate regression has one dependent and two independent variables. You can increase the number of your independent variables as long as their inclusion is motivated by theory; they are predetermined (i.e. they are observable before the treatment occurs/the outcome is observed), they can be expected to correlate with both the dependent and the independent variable of interest; and there are sufficient degrees of freedom (simply speaking, number of observations is higher than the number of independent variables in the model).

Last week, our models where age predicted the Leave vote had some unexplained variation in the dependent variable – that is, there was variation in our dependent variable that was not explained by our independent variable. By adding one or more independent variables, also known as covariates, you reduce the sum of the squared residuals and thus your R-squared increases. However, the goal of multivariate regression is to address confounding factors, so that we can better estimate the effect our independent variable of interest has on the dependent variable. Thus, an increase in the R squared does not in itself necessarily mean a better regression model, as it is no indication that we are actually explaining the effect of a given predictor \(X\) on our outcome \(Y\). In the social sciences, we are typically not so much interested in predicting or fully determining \(Y\), but rather in estimating possible connections between \(X\) and \(Y\). This is why assessing the relationship between \(X\) and \(Y\), and the confidence we have in its causality is our chief aim - we are not trying to increase the R squared! The equation for multivariate regression is similar to that for bivariate regression, with the difference that you add multiple betas and \(X_p\) independent variables:

\[ Y_i=\alpha+\beta_1 X_{i 1}+\beta_2 X_{i 2}+\ldots+\beta_p X_{i p}+\epsilon_i \]

where \(Y_{i}\) is a vector of the values of the dependent variable for each observation \(i\); \(X_{i1}\) through \(X_{ip}\) are the values for each observation \(i\) of distinct independent variables; \(\alpha\) is the predicted value of \(Y\) ( \(\hat{Y}\) ) when all of the independent variables (\(X_1\) through \(X_p\)) are equal to zero; \(\beta_1\) through \(\beta_p\) are the estimated regression coefficients; \(\epsilon_i\) is the idiosyncratic error term. Each regression coefficient represents the change in \(\hat{Y}\) relative to a one unit change in the respective independent variable holding other variables constant. In the multiple regression situation, \(\beta_1\), for example, is the change in \(Y\) relative to a one unit change in \(X_1\), holding all other independent variables constant (i.e., when the remaining independent variables are held at the same value or are fixed).

In R, the function for multivariate linear regression is lm(), with each predictor being added to the right of the formula with a plus sign:

lm(data = ... , dependent variable ~ independent variable_1
   + independent variable_2 + independent variable_3)

Working with Observational Data

Let’s return to our data set from before. Remember how we found that local authorities in London “stood out” as outliers due to their especially low share of Leave vote? Now we are going to explore further regional variation, and see how it combines with other variables to explain the distribution of the Leave vote. We start by loading in the data (save the dataframe preliminarily as brexit_complete), and installing the packages scatterplot3d and stargazer which we are going to need later on:

setwd("/Users/your_username/Documents/DataAnalyisR/script4")
brexit_complete <- read.csv("brexit_data.csv")

install.packages("scatterplot3d")
install.packages("stargazer")
library("scatterplot3d")
library("stargazer")

Here is a description of the variables we are going to work with today:

Variable Name Variable Description
Area Name of the local authority
Region Name of the region that contains the local authority
Percent_Leave Percentage Leave votes
age_mean The mean age of the area
Bachelors_deg_percent The percentage of people with bachelor’s degrees in an area
Health_very_good_percent The percentage of people with good health in the area

Remember how the NAs (missing data) kept getting in the way? Today, for simplicity, we just get rid of them, creating a new dataframe called simply brexit only with the observations with all complete entries:

brexit <- na.omit(brexit_complete)

When we remove observations, we always want to pay attention to which observations we’re removing. Are all local authorities equally likely to have missing data? Let’s look at the Region variable in the original dataset and in the new dataset

table(brexit_complete$Region)
table(brexit$Region)

We can use the merge() function to see the count for each region from the two tables side by side. First, we want to turn the tables into dataframes with the as.data.frame() command. This will return a dataframe with the names of the regions as the variable Var1, and their frequencies as a variable Freq. Then we merge the two dataframes into a single one, passing the argument by= to indicate the column by which we want to merge the two objects (in this case Var1), as well as the argument all = TRUE to make sure we are keeping all possible values of Var1 in the merged dataframe.

count_old <- as.data.frame(table(brexit_complete$Region))
count_new <- as.data.frame(table(brexit$Region))

merge(count_old, count_new, by = "Var1", all = TRUE)

So, we have removed all local authorities in Scotland and Northern Ireland, plus a few observations here and there in other regions. Why can this be a problem? Selection bias. Our sample is not representative of the population, because we are excluding observations non-randomly.

How do we get around this? If this were the dataset you are using for your thesis, you would want to collect more data to paint an accurate picture of Leave votes on the national level; as this is just an exercise, we can proceed further, as far as we understand that all our conclusions concern an incomplete subsample of English and Welsh local authorities only. This may not be great but more importantly it is absolutely imperative to be clear what our results are telling us - and where their limitations are!

First of all, let’s get rid of our original dataframe brexit_complete, so we don’t get confused. We will only be using the brexit dataframe subset from now on.

rm(brexit_complete)

Categorical Variables as Predictors

What would be your best prediction of Leave vote for a local authority knowing only the region in which they are located?

tapply(brexit$Percent_Leave, brexit$Region, mean)

As we noticed before, local authorities in London ‘stand out’ as particularly low on their share of the Leave vote. How would we quantify this effect in a regression setting? We can try and reproduce what we have learnt with continuous variables by creating a ‘London dummy’ variable that takes the value of 1 if the local authority is located in London, and 0 otherwise. The effect of a one-unit change (from 0 to 1) then is tantamount to comparing non-London with London observations:

brexit$ldn_dummy <- NA
brexit$ldn_dummy <- ifelse(brexit$Region == "London", 1 , 0 )

lm_ldn <- lm(Percent_Leave ~ ldn_dummy, data = brexit)
summary(lm_ldn)

How do we interpret this substantively? A local authority in London can be predicted to have a Leave vote of 56.1403 (the intercept) + (−16.3347 × 1) (the slope of the ldn_dummy variable) = 39.8056. We add (−16.3347 × 1) to the intercept because all London local authorities take the value of 1 for our ldn_dummy variable. A local authority outside of London takes the value of 0 for ldn_dummy, so the predicted Leave vote in this model is 56.1403 + (−16.3347 × 0) = 56.1403 - that is, exactly the intercept.

What if we want to predict the Leave Vote in non-London local authorities more accurately, given that we have information on the different regions they are located in? Let’s now create one more dummy for the West Midlands, the most ‘Brexit-y’ region in our sample, and add this variable to our lm_ldn model, to estimate our first regression with two independent variables:

brexit$wm_dummy <- NA
brexit$wm_dummy <- ifelse(brexit$Region == "West Midlands", 1 , 0 )

lm_ldn_wm <- lm(Percent_Leave ~ ldn_dummy + wm_dummy, data = brexit)
summary(lm_ldn_wm)

What would our best guess of Percent_Leave vote be for a local authority, knowing only that it’s in the West Midlands? It would be the intercept plus the slope for wm_dummy because these are observations that take the value of 1 for wm_dummy and 0 for ldn_dummy, hence 55.7091 + (−15.9035 × 0) + (4.5893 × 1) = 60.2984. And if it’s in London? It would be the intercept plus the slope for ldn_dummy because these are observations that take the value of 0 for wm_dummy and 1 for ldn_dummy, hence 55.7091 + (−15.9035 × 1) + (4.5893 × 0) = 39.8056. And if it’s neither in London nor in the West Midlands?

To regress on a categorical variable, we could continue adding dummies for each possible value (“East Midlands”, “North East”, “Wales” etc.) our predictor can take, or simply let R do it all at once:

lm_region <- lm(Percent_Leave ~ Region, data = brexit)
summary(lm_region)

At this stage, this is no different from computing subgroup means. As a local authority can only take a value of 1 for one of the categories, and by definition a categorical variable will take on 0 for all others (e.g. it can’t be both London and West Midlands), the predicted value of Percent_Leave will be the intercept plus the coefficient for the Region the local authority is in.

Note also that in the regression output R is apparently ‘dropping’ one Region: East. This is because local authorities in the Region ‘East’ will take a value of 0 for all other dummies, so our best guess for the outcome in the East is exactly equal to the intercept. When we include categorical variables in a regression, this will always happen: one category will serve as reference category and the coefficients for the other categories will indicate the expected difference between that category and the reference category. Therefore, in the regression above, we can interpret the value of the intercept 56.941 as the predicted value of Percent_Leave for local authorities in the East, and the coefficient −17.135 as the expected difference in Percent_Leave between a local authority in the East and a local authority in London.

By default, the reference category of a character variable in R is the first one alphabetically. This might not always be the one that provides the most intuitive or helful results. You can change the reference category by recoding the variable as a factor variable with as.factor(). In a factor variable, each possible value our variable takes is stored as an integer (1, 2, 3 etc.) as well as a character. Then you can change the levels with the relevel() function. In this way, R will use as reference category the variable with level = 1.

brexit$Region <- as.factor(brexit$Region)
levels(brexit$Region)

brexit$Region <- relevel(brexit$Region, ref = "Wales")
levels(brexit$Region)

lm_region <- lm(Percent_Leave ~ Region, data = brexit)
summary(lm_region)

Does changing our reference category change anything about our predictions? Why or why not?


Modelling and Visualising the Effects of a Dummy and a Continuous Variable

In the previous script, you found a positive effect of mean age of a local authority on the Leave vote share in that area by running a bivariate regression:

lm_age <- lm(data = brexit, Percent_Leave ~ age_mean)
summary(lm_age)

How good is our fit? Is there any variable that can help us reduce the square of residuals further? Let’s see if our London dummy can help. To look into this, we can now turn to the residuals of the lm_age model (i.e. the distance between predicted and actual values) and compare them for London and non-London local authorities. We can do this graphically with a boxplot, or by regressing the residuals on our ldn_dummy variable.

residuals <- residuals(lm_age)
boxplot(residuals ~ brexit$ldn_dummy)
lm(residuals ~ brexit$ldn_dummy)

The boxplot and the regression show that the residuals have substantially lower values for local authorities in London than elsewhere. This means that “London” local authorities tend to be below the regression line for the bivariate lm_age model: that is, they have lower values of Percent_Leave than predicted on the basis of their mean age by our bivariate model. Thus, it is probable that by using only age as a predictor, we’re not weeding out variation in Percent_Leave due to the variation in regional differences (“London-ness”, if you will) in our sample. To account for the effects of both age and London simultaneously, we use a multivariate model:

lm_ldn_age <- lm(Percent_Leave ~ ldn_dummy + age_mean, data = brexit)
summary(lm_ldn_age)

We interpret these coefficients as follows: a 1-year increase in mean age of a local authority leads to a 0.7063 increase in predicted Leave vote share independent of the effect of whether the local authority is in London or outside London. A local authority being in London reduces the predicted Leave vote share by 12.7740 percentage points, independent of the effect of local authority mean age. So, our best guess for a local authority in London with a mean age of 50 would be 27.4593 + (−12.7740 × 1) + (0.7063 × 50). What’s our best guess for a local authority outside London with a mean age of 50?

Now, compare the two bivariate models lm_ldn and lm_age with our multivariate model lm_ldn_age.

stargazer(lm_ldn, lm_age, lm_ldn_age, type = "html", 
          out = "ldn_age_models.html")

What happened to our coefficients? What happened to our \(R^2\)?

When we work with multivariate regressions, the more predictors we add to our model, the bigger our R squared gets. This is by design. However, this may not tell us much about the relative performance of various models: you could just throw in hundreds of variables and get marginally higher and higher values of \(R^2\). That’s why it’s often more useful to look at the adjusted R squared goodness-of-fit coefficient, which adds a penalty for each additional predictor when calculating the model fit:

\[ A d j . R^2=1-\left(1-R^2\right) \frac{n-1}{n-k-1} \]

where \(k\) refers to the number of predictors in the regression model.

Now let’s try and visualise what we have found in the lm_ldn_age model. What are the slope and the intercept for the function that predicts values of Percent_Leave for observations outside London? What are the slope and the intercept for observations in London?

plot(brexit$age_mean, brexit$Percent_Leave)
abline(a = 27.45934, b = 0.7062648)
abline(a = 27.45934 + (-12.77396),
       b= 0.7062648, col = "red")

# note: abline takes two parameters, a = intercept and b = slope;
# because the ldn_dummy variable only takes the values of 0 and 1,
# we can think of the regression line for London observations as a
# line that has as intercept the intercept plus the coefficient for
# ldn_dummy and as the slope of age.

# Or, more elegantly:

plot(brexit$age_mean, brexit$Percent_Leave)
abline(a = lm_ldn_age$coefficients[1], b = lm_ldn_age$coefficients[3])
abline(a = lm_ldn_age$coefficients[1]+lm_ldn_age$coefficients[2],
b= lm_ldn_age$coefficients[3], col = "red")
points(brexit$age_mean[brexit$ldn_dummy == 1],
brexit$Percent_Leave[brexit$ldn_dummy == 1], col = "red")

Technically, what the multivariate regression is doing is not fitting “two lines” but rather it is fitting a plane, defined by three dimensions: the London dummy, mean age and Percent Leave. The logic is exactly the same as the regression lines we plotted in the previous script: it’s the plane that minimises the squares of residuals (the math is a bit trickier though).

scatterplot3d(brexit$ldn_dummy, brexit$age_mean, brexit$Percent_Leave,
              color = brexit$ldn_dummy+1)$plane3d(lm_ldn_age)

The 3-D plot shows Percent Leave on the Z-axis (vertical height), mean age on the Y-axis (depth in perspective), and London dumm on the X-axis (horizontal length). Note that the latter only takes values of 0 and 1 so the “cloud” of observations is effectively flattened either on the right (London local authorities) or on the left (non-London local authorities) faces of the cube. The two lines corresponding to the left and right sides of the plane running through them are inclined according to the slope of age. The intercept of the plane with the Percent_Leave vertical axis on the left, where the London dummy is 0, represents the intercept of our model. The intercept of the plane with the Percent_Leave vertical axis on the right, where the London dummy is 1, represents the intercept of our model plus the (negative) coefficient for ldn_dummy.


Modelling the Effects of Two Continuous Variables and Regression Anatomy

Now that we have visualised the basic principle of a multivariate regression, let’s generalise this to the case of two continuous independent variables, and look further into the computations going on “behind the scenes” when we call the lm() function with two regressors.

What other things may be related to Leave Vote? Education is something that shows up continuously in the analysis of Brexit voting patterns, and a great candidate for a potential ‘omitted variable’ confounding our regression models. That is, we may have reason to believe that some of the variation in the Leave Vote share that is being explained e.g. with variation in mean age (or regional variation) actually is to be attributed to variation in the omitted variable, education.

We can test for this by checking how the residuals of our original lm_age model correlate with education levels. To operationalise education, we are going to use the Bachelors_deg_percent (share of residents with at least an undergraduate degree). We proceed in two steps:

  1. First, re-run the linear regression without the potential omitted variable. Here, regress the Leave vote share on mean age only.

  2. Then, inspect the relationship between the residuals from that regression and the potentially omitted variable, by either:

• (i) plotting them against the potential omitted variable (here, education),

• (ii) calculating the correlation between the residuals and the potential omitted variable, or

• (iii) running a regression of the residuals on the potential omitted variable.

A positive/negative slope, a strong correlation, or a large coefficient estimate indicate that the potential omitted variable (education) and the error term from the bivariate regression are correlated i.e. the model likely suffers from omitted variable bias.

Reveal Answer

Let’s go back to our original age-only bivariate model, and extract the residuals with the residuals() function.

lm_age <- lm(data = brexit, Percent_Leave ~ age_mean)
residuals <- residuals(lm_age)

Now let’s look at the relationship between the residuals of our age-only model and the education variable:

plot(residuals ~ brexit$Bachelors_deg_percent)
cor(residuals, brexit$Bachelors_deg_percent)
lm(data = brexit, residuals ~ Bachelors_deg_percent)

Levels of education are strongly and negatively related with the residuals of our bivariate regression, so we would want to add it alongside age in our model:

lm_age_edu <- lm(data = brexit, Percent_Leave ~ age_mean +
Bachelors_deg_percent)
summary(lm_age_edu)



In a multivariate regression, we interpret each \(\beta_i\) as the change in \(\hat{Y}\) relative to a one unit-change in \(X_i\), holding all other independent variables constant (i.e., when the remaining independent variables are held at the same value or are fixed). How do you substantively interpret each coefficient in this case?

Let’s compare this to the age-only and age-and-education models side by side using stargazer, with some added parameters to make it look nicer:

stargazer(lm_age, lm_age_edu, type = "html",
          out = "age_edu_models.html",
          covariate.labels = c("Mean Age",
                               "Share Degree-Holders",
                              "Intercept"),
          digits = 3, dep.var.labels = "Leave Vote Share")

Can you add the bivariate model to the table? What has happened to our coefficient for mean age? What has happened to our \(R^2\)?

And this is what the regression plane looks like in three dimensions.

scatterplot3d(brexit$age_mean,
              brexit$Bachelors_deg_percent,
              brexit$Percent_Leave,
              angle = 250)$plane3d(lm_age_edu)

Now, let’s try and understand where our coefficients for age and education are coming from. So far we have interpreted them as the effect of an independent variable on the dependent variable net of the effects of the other independent variable. More formally, we can show that the multivariate regression coefficient of any individual variable can be obtained by first netting out the effect of other variable(s) in the regression model from both the dependent variable and the independent variable. (This is known as the Frisch-Waugh theorem.)

Let’s prove this with a step-by-step solution for the age_mean coefficient of our multivariate model lm_age_edu.

• (i) First, net out the effect of Bachelors_deg_percent on age_mean by deriving the residuals of a bivariate regression where age_mean is the dependent variable and Bachelors_deg_percent is the independent variable:

partial1 <- lm(age_mean ~ Bachelors_deg_percent, data = brexit)
partial1
residuals_partial1 <- residuals(partial1)

• (ii) Secondly, net out the effect of Bachelors_deg_percent on Percent_Leave by deriving the residuals of a bivariate regression where Percent_Leave is the dependent variable and Bachelors_deg_percent is the independent variable:

partial2 <- lm(Percent_Leave ~ Bachelors_deg_percent, data = brexit)
partial2
residuals_partial2 <- residuals(partial2)

• (iii) Finally, regress the residuals of Percent_Leave after netting out the effect of `Bachelors_deg_percent on the residual variance in age_mean after netting out the effect of Bachelors_deg_percent on it.

full_model <- lm(residuals_partial2 ~ residuals_partial1, data = brexit)

summary(full_model)

The coefficient is, indeed, the same as the age_mean coefficient in our multivariate model fit lm_age_edu. What we’ve learnt is that, in short, we can break down any multivariate regression into an iteration of bivariate regressions.

How would you reproduce the coefficient for Bachelors_deg_percent from the same model using this method?


A Final Exercise with our Brexit Dataset

First, create a dummy variable that takes the value of 1 if the local authority has an above-average share of residents who report being in very good health, and 0 otherwise.

Reveal Answer
brexit$hea_dummy <- NA
brexit$hea_dummy[brexit$Health_very_good_percent >
                  mean(brexit$Health_very_good_percent)] <- 1
brexit$hea_dummy[brexit$Health_very_good_percent <=
                  mean(brexit$Health_very_good_percent)] <- 0



Now regress Leave vote on this new dummy and the London dummy we created earlier:

Reveal Answer
lm_ldn_hea <- lm(data = brexit, Percent_Leave ~ ldn_dummy +
                  hea_dummy)

summary(lm_ldn_hea)



• What Leave vote share would the model predict for a local authority that has below-average residents in very good health and is located outside London?

Reveal Answer
60.7085 #the value of the intercept



• What Leave vote share would the model predict for a local authority that has above-average residents in very good health and is located outside London?

Reveal Answer
60.7085 - 10.3918
#the intercept plus the "health dummy" coefficient



• What Leave vote share would the model predict for a local authority that has below-average residents in very good health and is located in London?

Reveal Answer
60.7085 - 12.8577
#the intercept plus the "London dummy" coefficient



• What Leave vote share would the model predict for a local authority that has above-average residents in very good health and is located in London?

Reveal Answer
60.7085 - 10.3918 - 12.8577
# the intercept plus the "London dummy" coefficient
# AND the "health dummy" coefficient.



Now compute the sample means of Percent_Leave for all these subgroups.

Reveal Answer
mean(brexit$Percent_Leave[brexit$ldn_dummy == 0 & brexit$hea_dummy == 0])
mean(brexit$Percent_Leave[brexit$ldn_dummy == 1 & brexit$hea_dummy == 0])
mean(brexit$Percent_Leave[brexit$ldn_dummy == 0 & brexit$hea_dummy == 1])
mean(brexit$Percent_Leave[brexit$ldn_dummy == 1 & brexit$hea_dummy == 1])

#or, more elegantly:

tapply(brexit$Percent_Leave, list(London = brexit$ldn_dummy,
                                  Health = brexit$hea_dummy), mean)



Why are our predicted values not the same as the observed mean values? What’s the difference between this regression with two dummies and our regression model with the London and West Midlands dummies?


Plots in ggplot

As our plots become more complex and interesting, it might be worth learning how to reproduce them ggplot2, which normally takes fewer lines of code and employs a more intuitive language. More resources on this package are in the additional material for script 2. It’s entirely up to you whether you want to look into this - for this class, plots in base R are perfectly fine.

To install and load the package:

install.packages(tidyverse)
library(tidyverse)

Horizontal boxplot of Leave Vote by region:

ggplot(data = brexit,
       mapping = aes(x = Region, y = Percent_Leave, fill = Region)) +
geom_boxplot() + theme_minimal() + ylab("Percent Leave") +
xlab("Region") + coord_flip() + theme(legend.position = "n") +
ggtitle("Leave Vote by Region") +
theme(plot.title = element_text(hjust = 0.5))

For the 2-dimensional age andLondon scatterplot:

ggplot(data = brexit, mapping = aes(x = age_mean, y = Percent_Leave,
                                    col = as.factor(ldn_dummy),
                                    group = as.factor(ldn_dummy))) +
geom_point() +
  scale_color_manual(values = c("blue", "orange"),
                     labels = c("Outside London", "In London"),
                     name = "Local Authority") +
geom_abline(intercept = 27.45934, slope = 0.7062648, col = "blue") +
geom_abline(intercept = 27.45934 + (-12.77396),
            slope = 0.7062648, col = "orange") +
ylab("Percent Leave") + xlab("Mean Age") +
ggtitle("Leave Vote by Age and Location of Local Authority") +
theme(plot.title = element_text(hjust = 0.5)) + theme_minimal()

Some Notes on Randomisation and Confounding

Confounding variables – variables that are related to both the independent variable and the dependent variable – are perhaps the main challenge in empirical social science research. Crucially, the direction of the bias that confounders introduce is ambiguous ex ante: they can either inflate or deflate estimated coefficients. Since a confounder may even be unobservable (i.e. be a latent variable), it may be difficult to control for it even if we know the true underlying data generating process. If a potential confounder is not the only confounder in the true model, then controlling for that confounder can actually worsen one’s ability to draw valid causal inferences by reducing the bias on one side (positive) but not the other (negative, or vice versa); this will pull OLS estimates further from the true population parameter.

For more on this, see Jason Seawright, “Regression-based inference: a case study in failed causal assessment” (in Brady and Collier, 2010, 2nd edition), as well as, Kevin Clarke, “The phantom menance: omitted variable bias in econometric research” (2005).

Random assignment to treatment solves this problem since observed and unobserved confounders are balanced between control and treatment groups in expectation. In the exercise that follows for instance, the treatment variable in our dataset is the timing that the conditional cash transfer scheme (Progresa) was introduced in a precinct (early or late). Early progresa precincts are our treatment group. However, researchers must demonstrate that both treatment and control groups are balanced with respect to salient pre-treatment characteristics. This involves quantitative balance tests, checking that the means of of the pre-treatment characteristics are equivalent for treatment and control groups, but also qualitative observations about how treatment was assigned and why it was (as-if) at random.

For more on this, seee Dunning, “Natural experiments in the social sciences” (2012).


Exercise: Do Targeted Government Programs Increase Pro-Incumbent Voting?

The data for this week comes from: Ana de la O. (2013). ‘Do Conditional Cash Transfers Affect Voting Behavior? Evidence from a Randomized Experiment in Mexico.’ American Journal of Political Science, 57:1, pp.1-14.

In this exercise, you will be revisiting some previous topics and refine them using techniques learned in this script. Previously, you’ve looked at treatment and control groups, descriptive statistics and linear regression. Now, you’re going to combine these to consider how well randomisation holds up in an actual experiment and estimate how a supposedly randomly assigned treatment impacted political behavior.

The original study looks at whether conditional cash transfers to poor citizens affect voting behavior, specifically whether they boost turnout and support for the incumbent party. In the late 1990s, the Mexican government began a program called Progresa that transferred about 35USD per month to eligible poor families. Crucially, poor families were randomly assigned to receive these cash transfers either in September 1998 (early) or in January 2000 (late). National elections were held in July 2000. The randomisation was conducted at the village level–a government agency first randomly selected the villages to receive early Progresa or the late Progresa before all eligible families in that village received the disbursement (early or late). In this dataset, we have information at the precinct level (precincts may contain many villages), which is the smallest unit for which electoral results are published. To circumvent problems of comparability, every precinct in our dataset contains only 1 village that was randomly assigned either early or late Progresa.

The author, following the distributive politics literature, expected this conditional cash transfer (CCT) program to mobilise voters, leading to an increase in turnout and support for the incumbent party (PRI in this case). The outcome variable is PRI vote share in the 2000 election (pri2000v) and the treatment is early or late Progresa (treatment == 1 if the village received early Progresa).

You can download the data above. Set your working directory and read in the .csv file. The names and descriptions of variables in the data set are below. Each observation in the data represents a precinct.

Variable Name Variable Description
treatment Whether an electoral precinct contains a village where households received Early Progresa
pri2000v Official PRI vote share in the 2000 election
pri1994 Total PRI votes in the 1994 presidential election
avgpoverty Precinct Avg of Village Poverty Index
pri2000s PRI votes in the 2000 election as a share of precinct population above 18
t2000r Official turnout in the 2000 election
t1994r Official turnout in the 1994 election
pobtot1994 Total Population in the precinct
pan1994 Total PAN votes in the 1994 presidential election
prd1994 Total PRD votes in the 1994 presidential election
pri1994s Total PRI votes in the 1994 election as a share of precinct population above 18
pan1994s Total PAN votes in the 1994 election as a share of precinct population above 18
prd1994s Total PRD votes in the 1994 election as a share of precinct population above 18
pri1994v Official PRI vote share in the 1994 election
pan1994v Official PAN vote share in the 1994 election
prd1994v Official PRD vote share in the 1994 election
votos1994 Total votes cast in the 1994 presidential election
t1994 Turnout in the 1994 election as a share of precinct population above 18
t2000 Turnout in the 2000 election as a share of precinct population above 18
villages Number of villages in the precinct
rm(list=ls())

setwd("~/Desktop/DataAnalysisR/script4")

cct <- read.csv("cct.csv") #load data and call it "cct", conditional cash transfer

summary(cct)

Task 1

The main outcome variable that you will be using here is pri2000v: the incumbent PRI party’s official vote share in the 2000 election measured at the precinct level. What variables might be good predictors of PRI support in 2000?

• Run a regression of pri2000v on avgpoverty. Interpret the coefficient.

• Next, run a regression of pri2000v on pri1994v. Interpret the coefficient.

• How might the coefficients change when including both predictors in a multivariate regression model? Why?

Task 2

If previous PRI support and average poverty are themselves correlated, we may be estimating a biased relationship when using a bivariate linear regression involving either.

• Are pri1994v and avgpoverty correlated?

• Run a regression of pri200v on avgpoverty. Save the residuals and plot the residuals (y-axis) against the fitted values (x-axis). Add a regression line to the plot that fits the relationship between the residuals and the fitted values.

• Next, add a regression line to the plot that represents the relationship between the residuals and pri1994v. Is there a relationship in the residuals?

Task 3

How would including both average poverty and previous PRI support affect the OLS estimates?

• Run a regression using both avgpoverty and pri1994v to predict the PRI vote share in 2000. Interpret the coefficients in light of previous estimates of the separate effects of avgpoverty and pri1994v.

Task 4

You will now consider the main treatment in the study: whether or not a precinct received Progresa early (treated) or late (control). If treatment assignment is indepedent of pre-treatment characteristics, then there should be no relationship between any of the pre-treatment variables and treatment assignment. Assess whether this assumption holds true.

• Identify the mean values of avgpoverty for the treatment and control groups, respectively. Do they differ?

• Create an according box plot for the treatment group and one for the control group. Do they differ?

Task 5

What is the effect of early treatment on support for the PRI in 2000?

• Run a regression of pri2000v on treatment.

• Given what you just learned about the relationship between treatment and avgpoverty, how would you expect avgpoverty to affect the regression model? If the treatment and control groups are balanced with respect to avgpoverty, what should the coefficient be for avgpoverty in a regression that also includes treatment?

• Regress pri2000v on treatment and avgpoverty. Interpret the coefficients.

Task 6

In a natural experiment, treatment is independent of observed and unobserved characteristics in expectation. Just as you did with avgpoverty, evaluate this claim with respect to previous support for the incumbent party (pri1994v). Use tapply() and boxplot(). Are treatment and control groups balanced?

• How would including the previous PRI variable affect the OLS estimate for the treatment? Do you expect the coefficient on treatment to change?

• Run a regression of pri2000v on treatment and pri1994v. Interpret the coefficient. Does early Progresa lead to greater support for the incumbent party?

Task 7

Finally, evaluate a model with key control variables in the dataset. Namely, regress pri2000v on treatment, pri1994v, pan1994v, prd1994v, avgpoverty, t1994r and population. How does the coefficient on the treatment variable change when control variables are added? What effect did the policy intervention have on support for the incumbent party?